Avoiding Data Graveyards: Deriving an Ontology for Accessing Heterogeneous Data Collections

نویسنده

  • Christian Chiarcos
چکیده

In this paper, I describe derivation and practical application of an ontology of word classes manually derived from four different sources: – the EAGLES recommendations for the morphosyntactic annotation of corpora, – several language-specific, or task-specific tag sets for part-of-speech tagging, – the typologically-oriented SFB632 guidelines for part-of-speech tagging, and – the General Ontology for Linguistic Description (GOLD). The resulting ontology is intended to provide integrated representation and access to terminologically heterogeneous resources. It will be applied as part of a sustainable archive of linguistic resources to be developed by the project ”Sustainability of Linguistic Data”, a just-started joint initiative by three German special research centers. While in the first phase, the focus of the ontology development has been put on terminology for part-of-speech (POS) tagging which requires hand-crafted methods, a possible extension towards the semi-automatic integration of syntactic annotation will be sketched as an outlook. 1 Background and Motivation For researchers unfamiliar with the specific usage and origins of terms that have been applied in the creation of a data source such as a corpus, the variety of abbreviations, terms, tags and possibly conflicting definitions can be confusing and time-consuming. In a worst case scenario, the effort necessary for a closer examination of the data will prevent later generations of researchers from working with a data collection. The problem becomes even more apparent for very large collections of heterogeneous corpora. That is why it is an urgent task for the unified treatment of such collections to identify and to document commonalities as well as differences in the terminology used: the integration of information on the linguistic terminology can be seen as a core aspect of sustainable maintenance of linguistic data Such questions are addressed by the project “Sustainability of Linguistic Data”, a collaborative initiative formed by three research centers, SFB 441 (Tübingen, “Linguistic Data Structures”), SFB 538 (Hamburg, “Multilingualism”) and SFB 632 (Potsdam/Berlin, “Information Structure”) to provide means to guarantee the long-time availability and accessibility of the collected resources. The project is intended to develop sustainable solutions for creation, maintainance, accessibility and distribution of linguistic resources. One of our primary aims is to provide the means to ensure the long-term availability of the data collections. Along with technical aspects, as discussed by Dipper et al. [5], this goal involves creating a thorough documentation for the corpora in order to provide easy access for non-specialised users. This includes meta data about the corpora themselves, such as type of data, formats, standards and levels of annotation. Furthermore, the terminology relevant for the annotations has to take into account sustainability considerations. meta tag sets and multilingual tag sets language-specific tag sets languages granularity n/a Tibetan tag set Tibetan ≥ 36 tags EAGLES generalization over SUSANNE English ≈ 420 tags existing tag sets for STTS, 3 variants German 54 (718) tags European languages MENOTA Old Norse ≈ 13055 tags MULTEXT-East adaption Russian tag set Russian ≥ 877 tags of EAGLES SFB632 annotation standard designed for typological research n/a 26 languages ≈ 79 tags SFB538/E2 tag set reduced tag set for acquisition studies n/a German, Romance, Basque ≥ 8 tags Table 1. Tag sets and meta-tag sets in the SFBs. Focusing on the tag sets used for part-of-speech (POS) tagging in Tübingen, Hamburg and Potsdam/Berlin, we find that our research centers create and use POS-annotated corpora for 29 languages or language stages, annotated according to nine tag sets or tag set variants, cf. Tab. 1. With this amount of data, several problems can be identified that hinder the direct access to data by using these tag sets: (i) tag names are cryptic, arbitrary and appear in idiosyncratic variants, (ii) different communityor project-specific definitions of tags with the same names, (iii) tag definitions can be extremely complex or missing, (iv) tag sets are of differing granularity. To overcome these problems it is necessary to provide a consistent terminology and to refer to this terminological backbone in the definition of annotation structures. 2 Towards an Ontology of Word Classes A classical solution to the problem is the “standardisation approach” as employed by the EAGLES recommendations [13]. There, standards for POS tag sets have been formulated – further referred to as the “EAGLES meta scheme” –, which are intended to increase tagging accuracy and comparability of automatic taggers and tag sets for most European languages. In a bottom-up approach, existing tag sets for several European languages have been considered, and commonly used terms and categories have been identified. As a result, 13 obligatory categories were postulated. For each category, a list of features has been assembled that a standard-conformant tag set should respect. Accordingly, the “EAGLES meta tag set” is constituted as the set of reasonable combinations of categories (main tags) and features. The standardisation approach faces several disadvantages: Language-specific conceptualisations have to be integrated into the meta-scheme. As a consequence, the complexity of every standardconformant scheme is projected onto the meta-scheme. Further, the outcome of the bottom-up process in the case of EAGLES was not a full terminological resource, but only a list of terms. As long as no definitions are included in the description of the standard, community-specific usage of terms can lead to contradictory interpretations of the corresponding tags. This certainly contradicts any effort of standardisation. Finally, the solution is not scalable as it cannot be applied directly to non-European languages [12]. To overcome the deficits of the standardisation approach and the definition of meta tag sets, the application of an ontology similar to the GOLD approach [7] can be considered. In contrast to the EAGLES initiative, which was dedicated to European languages exclusively, in the E-MELD project GOLD aspects of universality and scalability were emphasized from the beginning. Instead of providing a generalisation of tag sets for a fixed range of languages, it aimed to cover the full typological variety as far as possible. Finally, it took a different starting point due to its orientation towards the documentation of endangered languages. As opposed to this, our joint initiative aims to achieve a unified representation and access to existing resources, which – in their quantitative majority – deal with European languages. Accordingly, we suggest to develop an ontology based on established meta-schemes such as EAGLES, i.e., we do not plan a direct adaption of GOLD. For standard-conformant tag sets, then, the linking with this ontology becomes trivial. Still, as these meta-schemes suffer from the problems of standardisation approaches in general, we suggest a harmonisation between our EAGLES-based ontology and GOLD. Accordingly, the terms used in EAGLES are provided with a formal definition retrievable from the mapping between EAGLES and GOLD. Finally, other non-EAGLES conformant tag sets will be integrated into the ontology. Thus, our terminological backbone will be created in a three-step methodology: 1. derive an ontology from EAGLES, 2. harmonise this ontology with GOLD, and finally 3. integrate other non-EAGLES conformant tag sets. 3 Mapping Tags to Concepts In this section, I sketch the derivation of an ontology for the case of Nouns and the German tag set STTS [16]. This procedure has been performed for a slightly reduced version of the EAGLES meta scheme. We have developed a prototype of the ontology based on the EAGLES meta scheme which we refer to as the ”EAGLES ontology”. This ontology, implemented in OWL/DL, covers the obligatory categories of the EAGLES meta scheme and those recommended features pertaining inherent properties of part of speech tags, i.e. those which serve to classify different functions of words, but not merely morphological distinctions. The linking with GOLD has been investigated and, for testing purposes, other non-EAGLES conformant tag sets have been mapped onto this rudimentary ontology as well. Further, a first prototype of a module which supports ontology-sensitive querying has been developed. Additional grammatical features will be considered in the next version of this prototype.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query Processing the Heterogeneous Information Sources using Ontology-based Approach

The problems of accessing and integrating heterogeneous information sources are becoming centerstage problems. One problem arising from accessing heterogeneous sources is semantic heterogeneity. In this paper, we propose a metadata dictionary based on domain ontology as an assistant mechanism for query processing the heterogeneous sources and resolving semantic heterogeneity. An XML-based data ...

متن کامل

Museum Collections and the Semantic Web

The paper discusses some current trends in the area of development and use of semantic portals for accessing heterogeneous museum collections on the Semantic Web. The presentation is focused on some issues concerning metadata standards for museums, museum collections ontologies and semantic search engines. A number of design considerations and recommendations are

متن کامل

Context-aware Modeling for Spatio-temporal Data Transmitted from a Wireless Body Sensor Network

Context-aware systems must be interoperable and work across different platforms at any time and in any place. Context data collected from wireless body area networks (WBAN) may be heterogeneous and imperfect, which makes their design and implementation difficult. In this research, we introduce a model which takes the dynamic nature of a context-aware system into consideration. This model is con...

متن کامل

A Semantic Information Gathering Approach for Heterogeneous Information Sources

The increasing demand for accessing heterogeneous information sources to support global applications and decision-making requirements forces organizations to solve heterogeneity problems. One of the important problems stemming from accessing the heterogeneous data is semantic heterogeneity. A number of research efforts have been proposed to address this problem, ranging from mediatorbased syste...

متن کامل

Semantic Information Gathering Approach for Heterogeneous Information Sources on WWW

The increasing demand for accessing heterogeneous information sources to support global applications and decision making requirements forces organizations to solve heterogeneity problems. One of the important problems stemming from accessing the heterogeneous data is semantic heterogeneity. A number of research efforts have been proposed to address this problem, ranging from mediator-based syst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006